tutorials/036 - Distributing Calls with Glue Interactive Sessions on Ray.ipynb (169 lines of code) (raw):
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"editable": true,
"trusted": true
},
"source": [
"[](https://github.com/aws/aws-sdk-pandas)\n",
"\n",
"# 36 - Distributing Calls on Glue Interactive sessions\n",
"\n",
"AWS SDK for pandas is pre-loaded into [AWS Glue interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/is-using-ray.html) with Ray kernel, making it by far the easiest way to experiment with the library at scale."
]
},
{
"cell_type": "markdown",
"metadata": {
"editable": true,
"trusted": true
},
"source": [
"In AWS Glue Studio, choose `Notebook` to create an AWS Glue interactive session:\n",
"\n",
"\n",
"\n",
"Then select `Ray` as the kernel. The IAM role must trust the AWS Glue service principal.\n",
"\n",
"\n",
"\n",
"Once the notebook is up and running you can import the library. You can install `awswrangler` and `modin` as additional dependencies."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"tags": [],
"trusted": true,
"vscode": {
"languageId": "python"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Additional python modules to be included:\n",
"awswrangler\n",
"modin\n"
]
}
],
"source": [
"%additional_python_modules awswrangler,modin"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [],
"trusted": true,
"vscode": {
"languageId": "python"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::463623607974:role/service-role/AmazonSageMakerServiceCatalogProductsGlueRole\n",
"Trying to create a Glue session for the kernel.\n",
"Worker Type: Z.2X\n",
"Number of Workers: 5\n",
"Session ID: 32566e82-34d2-4db7-adac-cbee573e20bf\n",
"Job Type: glueray\n",
"Applying the following default arguments:\n",
"--glue_kernel_version 0.38.1\n",
"--enable-glue-datacatalog true\n",
"--auto-scaling-ray-min-workers 1\n",
"--additional-python-modules awswrangler,modin\n",
"Waiting for session 32566e82-34d2-4db7-adac-cbee573e20bf to get into ready status...\n",
"Session 32566e82-34d2-4db7-adac-cbee573e20bf has been created.\n"
]
}
],
"source": [
"import awswrangler as wr"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": [],
"trusted": true,
"vscode": {
"languageId": "python"
}
},
"outputs": [],
"source": [
"df = wr.s3.read_parquet(path=\"s3://ursa-labs-taxi-data/2017/\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": [],
"trusted": true,
"vscode": {
"languageId": "python"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" vendor_id pickup_at ... improvement_surcharge total_amount\n",
"0 1 2017-01-09 11:13:28 ... 0.3 15.300000\n",
"1 1 2017-01-09 11:32:27 ... 0.3 7.250000\n",
"2 1 2017-01-09 11:38:20 ... 0.3 7.300000\n",
"3 1 2017-01-09 11:52:13 ... 0.3 8.500000\n",
"4 2 2017-01-01 00:00:00 ... 0.3 52.799999\n",
"\n",
"[5 rows x 17 columns]\n"
]
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-warning\">\n",
"To avoid incurring a charge, make sure to delete the Jupyter Notebook when you are done experimenting.\n",
"</div>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Glue PySpark",
"language": "python",
"name": "glue_pyspark"
},
"language_info": {
"codemirror_mode": {
"name": "python",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "Python_Glue_Session",
"pygments_lexer": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}